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non-standard  features.  Networks  trained  on  a  large  corpus  of  unsegmented,  continuous  speech  without 
labels  also  developed  interesting  feature  representations  that  may  be  useful  in  both  segmentation  and 
label  learning. .  The  results  of  these  studies,  while  preliminary,  demonstrate  that  back-propagation 
learning  can  be  used  with  complex,  natural  data  to  identify  a  feature  structure  that  can  serve  as  the 
basis  for  both  analysis  and  non-trivial  pattern  recognition.  These  findings  can  make  a  significant 
contribudon  to  the  long-standing  problems  of  speech  feature  decomposition  and  computer-based  speech 
recognition. 
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INTRODUCTION 

The  recognition  of  speech  is  one  of  many  things  that  is  carried  out  by  humans  with  apparent  ease, 
but  that  has  been  done  by  computers  only  at  great  cost,  with  high  error,  and  in  highly  constrained  situa¬ 
tions.  Whether  or  not  one  is  interested  in  machine-based  speech  recognition  per  se,  the  difficulties 
encountered  by  such  systems  may  be  diagnostic  of  flaws  in  the  theoretical  frameworks  that  motivate 
them.  We  believe  that  part  of  the  difficulty  in  such  systems  lies  in  the  use  of  inappropriate  features  as 
units  for  recognizing  and  representing  speech.  But  what  are  the  appropriate  units  and  how  are  they  to 
be  found?  In  this  paper  we  describe  studies  designed  to  determine  whether  these  units  can  be  learned. 
We  use  a  newly  developed  learning  procedure  for  artificial  neural  networks.  This  approach  not  only 
provides  a  way  to  learn  to  recognize  speech,  but  also  enables  the  required  computations  to  be  carried 
out  in  parallel  in  a  brain-like  fashion. 

The  question  "What  are  the  units  of  speech  perception?"  has  been  long-standing  and  controversial. 
The  problem  is  that  when  cme  looks  at  the  acoustic  waveform,  there  are  rarely  obvious  clues  as  to  the 
boundaries  between  segments,  let  alone  an  indication  of  what  those  segments  are.  The  speech  sound 
wave  varies  continuously  and  smoothly  over  time.  There  is  nothing  mysterious  or  unexpected  in  this;  it 
is  the  acoustic  consequence  of  the  fact  that  the  production  of  speech  involves  a  high  degree  of  coarticu¬ 
lation  (i.e.,  the  mutual  influence  of  neighboring  sounds)  with  smooth  transitions  from  one  sound  to 
another. 

While  one  has  the  impression  that  speech  is  made  up  of  concatenated  "sounds,"  just  what  those 
sounds  are  is  debatable.  At  least  eight  different  levels  of  representation  have  been  proposed  to  inter¬ 
vene  between  the  speech  wave  and  the  representation  "word";  spectral  templates  (Lowerre,  1976), 
features  (Cole,  Stem,  &  Lasry,  1986),  diphones  (Dixon  &  Silverman,  1976;  Klatt.  1980),  context- 
sensitive  allophones  (Wickelgren,  1969),  phonemes  (Pisoni,  1981),  demisyllables  (Fujimura  &  Lovins. 
1978),  syllables  (Mehler,  1981),  and  morphemes  (Aronoff,  1976;  Klatt,  1980).  While  these  are  all  rea¬ 
sonable  candidates  for  representing  speech,  the  problem  is  that  they  have  not  been  derived,  in  a  canoni¬ 
cal  way,  from  the  speech  data  itself.  Learning  provides  a  systematic  way  to  find  recognition  features  in 
data. 

Whether  or  not  the  representations  used  in  the  perception  of  speech  are  innate  or  learned  remains 
open,  and  we  do  not  wish  to  take  a  strong  position  on  this  issue.  However,  much  of  the  motivation  for 
supposing  that  internal  representations  in  perception  are  innate  has  come  from  the  apparent  poverty  of 
data  and  the  weakness  of  learning  algorithms.  Recent  developments  in  parallel  distributed  processing 
(PDP)  learning  algorithms  have  demonstrated  that  a  surprisingly  small  amount  of  data  may  contain  suf¬ 
ficient  cues  to  its  intrinsic  structure  so  that  rather  simple  learning  rules  can  be  used  to  induce  this 
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structure.  In  the  current  paper  we  attempt  to  demonstrate  the  consequences  of  requiring  that  representa¬ 
tions  be  leamable.  To  do  this  we  have  taught  trainable  networks  a  series  of  speech  recognition  tasks 
and  then  examined  the  internal  representations  that  are  generated.  The  networks  are  extremely  adept  at 
solving  the  recognition  tasks.  They  spontaneously  develop  their  own  representations  which  sometimes, 
but  not  always,  correspond  to  our  previous  categories  of  representational  units. 


BACK  PROPAGATION  OF  ERROR 

In  these  studies  we  use  a  form  of  the  "generalized  delta-rule"  known  as  "back  propagation  of  error" 
(Rumelhart,  Hinton,  &  Williams,  1986).  This  algorithm  provides  a  technique  for  training  a  multilayer 
network  to  associate  many  pain  of  patterns.  The  complete  set  of  pattern  pain  to  be  learned  can  be 
thought  of  as  the  extensional  definition  of  a  vector-valued  function  whose  domain  is  the  set  of  input 
patterns  and  whose  range  is  the  the  set  of  outputs. 

As  learning  schemes  become  more  powerful,  the  set  of  functions  that  can  be  learned  increases.  For 
example,  the  perception  convergence  procedure  (Rosenblatt,  1962)  can  program  networks  to  compute 
linear  Boolean  functions  such  as  AND  and  OR  but  not  nonlinear  ones  such  as  XOR.  The  early  generali¬ 
zations  of  the  perception  rule  that  extended  the  learning  set  to  patterns  with  continuous  rather  than 
Boolean  values  are  also  limited  to  learning  linear  functions.  Back  propagation,  on  the  other  hand,  can 
program  multilayer  networks  to  compute  all  the  Boolean  functions.  Since  it  is  applicable  to  patterns 
with  continuous  component  values,  back  propagation  can  also  deal  with  a  much  wider  range  of  func¬ 
tions. 

Much  of  the  significance  of  back-propagation  learning  stems  from  the  fact  that  it  is  defined  on  a 
neural-like  network.  An  example  of  such  a  network  is  shown  in  Figure  1.  The  output  of  each  unit  is  a 
function  of  the  weighted  sum  of  its  inputs.  It  is  these  weights  that  are  changed  as  learning  proceeds. 
Each  layer  in  the  back-propagation  network  can  have  any  number  of  units.  In  the  work  described  here 
only  strictly  layered  networks  are  used  in  which  each  unit  in  a  layer  receives  inputs  from  every  unit  in 
the  layer  below.  Training  proceeds  in  cycles.  In  each  cycle  a  pattern  pair  is  chosen  from  the  function 
definition.  The  input  pattern  is  applied  to  the  first  layer  of  the  network,  and  the  activity  it  generates  is 
passed  successively  to  the  other  layers  until  it  produces  a  pattern  of  activity  on  the  output  layer.  This 
output  pattern  vector  is  then  subtracted  from  the  correct  output  pattern  to  produce  an  error  pattern.  This 
error  in  turn  is  used  to  adjust  the  weights  in  the  output  layer  and  then  (by  back  propagation)  to  adjust 
the  weights  in  lower  layers.  The  exact  way  the  error  is  propagated  back  down  through  each  layer  con¬ 
stitutes  the  novel  part  of  the  learning  rule.  Discovering  just  how  to  do  this  correctly  was  the  significant 
accomplishment  of  Rumelhart,  Hinton,  and  Williams  (1986).  The  details  of  their  procedure,  which  we 
have  used  in  this  work,  are  given  in  the  caption  of  Figure  1. 

When  computations  are  programmed  inductively  in  this  manner,  values  from  the  range  of  the  func¬ 
tion  must  be  available.  Frequently  these  values  are  supplied  by  an  external  source  or  "teacher,”  which 
assumes  that  such  a  teaching  input  is  available;  however,  in  certain  situations,  this  assumption  may  be 
an  unrealistic.  This  raises  the  question  of  how  to  configure  a  system  so  it  can  learn,  either  without  an 
external  teacher  or  with  the  kind  of  information  more  realistically  available.  Several  solutions  have 
been  proposed.  One  of  the  simplest  and  most  elegant  of  these  is  to  use  teaching  patterns  that  are  the 
same  as  the  input  or  some  fixed  transformation  of  the  input  While  this  would  seem  to  limit  us  to 
learning  the  identity  function  (or  some  fixed  transformation  of  it),  it  has  been  shown  that  with  this  pro¬ 
cedure  the  hidden  units  leant  to  represent  the  input  patterns  in  terms  of  salient  features.  When  the 
number  of  hidden  units  is  less  than  the  number  of  input  units,  the  information  in  the  input  is 
represented  at  a  lower  dimensionality.  In  many  perceptual  problems  this  lower  dimensional  feature 
representation  is  just  what  is  needed  as  a  basis  for  further  processing.  We  use  this  approach  in  some  of 
the  speech  recognition  ,tudies  reported  here. 
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FIGURE  1.  Strictly  layered  back  propagation  networks  of  the  kind  depicted  in  this  figure  were  used  for  all  the  work  described  in 
this  paper.  Before  training  begins,  all  the  weights  and  biases  are  set  to  random  values  between  —1.0  and  1.0.  At  the  beginning  of 
each  learning  cycle,  i.e.,  at  time  t  «  0  for  that  cycle,  the  pattern  generator  provides  an  input  pattern  (x,  x2  •  ■  •  x4)  and  a  target  pat¬ 
tern  (yx  y2  •  •  •  yk).  At  time  I  ”  1  the  outputs  of  the  input  units  are  set  equal  the  input  pattern,  that  is:  /,*xt> 
f  2  •  xj,  x k.  The  output  activities  of  the  hidden  units  are  set  at  time  /  »  2  to 

;** 

Hi  (t+1)  -  squash  (biasH.  +  £  W„  y  fy  (T )) 


J-t 


where 


squash  (x)  > 


1  +  *- 


Similarly,  the  activity  of  the  output  units  are  set  at  r  »  3  to 

i-i 

0,  (r+1)  -  squash(bias0  £  W0  y  ffy(O). 

1  ;'”t  * 

All  these  values  are  held  constant  until  the  end  of  the  current  learning  cycle.  The  next  step  is  to  change  all  the  weights  and 
biases  according  to  the  back  propagation  rule.  At  <  »  4  the  output  unit  weights  are  changed  by  the  rule 

W0  i m n  8  o  H 

&bias0  «  q  S0 


q  •  learning  rate  constant. 

At  i  »  5  the  hidden  unit  weights  are  are  changed  by 
WH.  y  -  q  5„.  fy 
A biaStf'  * 


4-1 


7VV*;  prj  its  p»"a  ^  ■ 


I 
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For  the  derivation  of  these  rules  see  Rumelhart,  Hinton,  and  Williams  (1986). 
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PRELIMINARY  CONSIDERATIONS 

Before  back  propagation  could  be  used  to  recognize  speech  it  was  necessary  to  find  a  way  to  present 
the  sound  to  the  network.  The  speech  signal  itself  changes  with  time  but  the  networks  require  fixed 
input  and  target  samples.  The  approach  we  used  here  was  to  present  the  learning  network  with  fixed 
input  patterns  each  of  which  consists  of  a  set  of  sequential  samples.  In  some  cases  the  individual  sam¬ 
ples  in  the  input  pattern  were  Fourier  amplitude  spectra;  in  other  cases  the  actual  digitalized  sound 
samples  were  used.  Preliminary  studies  were  carried  out  to  find  the  workable  ranges  for  the  number  of 
frequencies  and  the  amount  of  time  that  had  to  be  represented  in  the  input  samples. 

Another  consideration  was  the  way  to  normalize  this  data.  The  dynamic  range  of  the  Fourier  spectra 
is  large,  and  the  vast  majority  of  the  points  have  very  low  values.  We  found  that  certain  versions  of 
the  learning  procedures  had  difficulty  with  this  type  of  input,  consisting  of  a  vast  sea  of  near  zero 
values  with  a  few  high  peaks.  The  situation  was  further  complicated  by  large  amplitude  differences 
between  examples  of  the  same  sound. 

We  used  two  different  strategies  in  preprocessing  the  input  data.  One  strategy  involved  finding  ad 
hoc  methods  to  deal  with  the  problems  raised  by  the  amplitude  and  dynamic  range.  The  other  strategy 
was  to  train  a  network  to  do  the  preprocessing  itself.  Details  of  these  strategies  will  be  given  later. 


DIRECT  LEARNING  OF  PHONETIC  LABELS 

In  our  first  series  of  studies  we  asked  how  a  back-propagation  network  might  solve  the  problem  of 
learning  to  label  a  set  of  highly  confusable  syllables.  The  basic  idea  was  to  use  a  spectrogram  of  a 
sound  as  the  input  pattern  and  a  target  bit  pattern  with  one  bit  position  for  each  of  the  types  of  sound. 
The  task  of  the  network  was  to  leant  to  set  the  output  unit  corresponding  to  the  sound  type  of  the  input 
to  1,  while  setting  all  the  other  output  units  to  zero.  We  chose  the  syllables  [ba],  [bi],  [bu],  [da],  [di], 
[du],  [ga],  [gi],  and  [gu]  because  this  set  of  three  voiced  stops,  paired  with  each  of  three  vowels,  is 
known  to  exhibit  a  high  degree  of  variability  due  to  coarticulation.  Although  listeners  readily  report  all 
versions  of  (for  example)  the  [d]  as  sounding  the  same,  the  acoustic  patterns  corresponding  to  the  con¬ 
sonant  differ  greatly  across  the  three  vowel  contexts.  Indeed,  this  represents  the  paradigm  case  of  per¬ 
ceptual  invariance  coupled  with  acoustic  variability. 

The  stimuli  for  the  experiment  were  prepared  as  follows.  A  single  male  speaker  recorded  a  set  of 
505  tokens  of  the  set  of  nine  syllables  (about  56  tokens  of  each  syllable).  Tokens  were  recorded  in  a 
moderately  quiet  environment,  but  with  no  particular  effort  at  eliminating  background  noise;  nor  was  an 
attempt  made  to  ensure  a  constant  rate  of  speech  or  uniformity  of  pronunciation.  Recording  was  earned 
out  through  analog-to-digital  conversion  at  a  10  kHz  sampling  rate  and  low-pass  Filtered  at  3.5  kHz. 

The  beginning  of  each  consonant  was  located  and  an  FFT  analysis  was  earned  out  over  6.4  ms 
frames,  advancing  3.2  ms  per  frame  for  20  frames.  The  output  of  the  FFT  was  reformatted  to  give 
spectral  magnitudes  over  16  frequency  ranges.  As  a  result,  each  token  was  represented  as  a  16  \  20 
array  of  positive  values.  Finally,  these  FFT  magnitudes  were  normalized  to  the  token  average  and 
"squashed"  using  the  logistic.  Examples  of  the  resulting  values  for  several  tokens  are  graphed  in  Figure 
2. 

The  network  used  consisted  of  320  input  units,  between  two  and  six  hidden  units  (in  different  condi¬ 
tions),  and  as  many  output  units  as  there  were  types  to  be  labeled.  The  network  used  here  and 
throughout  this  work  was  strictly  layered;  i.e.,  every  unit  of  the  input  and  hidden  layer  was  connected  to 
every  unit  on  the  layer  above. 

The  data  set  of  sounds  was  divided  into  two  equal  parts.  One  was  used  for  training  and  the  other 
was  kept  for  testing  performance  on  untrained  examples.  On  each  cycle  of  the  teaching  phase  the  fol¬ 
lowing  occurred;  (a)  One  of  the  syllables  from  the  training  set  was  selected  at  random  and  applied  to 
the  input  layer;  (b)  activation  was  propagated  up  from  the  input  layer  to  the  hidden  layer,  and  from  the 


FIGURE  2.  Graphs  of  examples  of  the  nine  sounds  [bal,  [bi],  [bu],  [da],  [dal,  [du],  [ga],  [ga],  [gi],  and  [gu].  To  generate  this  data 
the  release  of  each  stop  was  located  under  computer  control,  and  syllables  wen  edited  to  include  5  ms  before  the  release  and  suffi¬ 
cient  time  following  release  to  allow  64  ms  of  FFT.  A  64-point  FFT  was  carried  out  over  Hamming-windowed  frames,  advancing 
32  points  per  frame.  Each  token  consisted  of  an  array  of  twenty  3.2  ms  frames;  within  each  frame  the  FFT  output  was  reformatted 
to  give  spectral  magnitudes  over  16  evenly  spaced  frequency  ranges,  then  normalized  to  the  average  for  the  token.  These  values 
were  then  transformed  using  the  logistic  function  y=  1/(1  -  e~^rpl.  This  procedure  maps  all  values  to  the  range  0  to  1,  and  the 
average  to  about  0.46.  Since  the  median  for  the  amplitude  spectra  is  generally  just  a  bit  greater  than  the  average  this  procedure 
guarantees  that  about  half  of  all  values  are  mapped  to  each  side  of  0.3,  which  is  beneficial  for  back-propagation  learning. 

hidden  layer  to  the  output  layer;  (c)  the  error  for  each  output  unit  was  calculated  and  back  propagated 
through  the  network,  ard  the  weights  were  adjusted  according  to  the  learning  rule. 

The  trained  network  was  tested  by  scanning  through  all  members  of  the  training  and  naive  data  sets. 
The  result  for  a  token  was  scored  as  correct  only  if  the  value  of  the  output  unit  that  should  be  1  was 
greater  than  0.5  and  the  values  of  all  the  other  output  units  were  less  than  0.5. 

Three  versions  of  simple  phonetic  labeling  were  run.  In  one,  nine  labels  were  used,  corresponding  to 
the  nine  sound  types.  In  the  other  two,  three  labels  were  used,  corresponding  either  to  the  three  vowels 
(ignoring  the  consonant  in  the  same  syllable)  or  else  to  the  three  consonants  (ignoring  the  vowels).  In 
all  cases  more  than  100,000  training  cycles  were  run.  This  extensive  training  was  used  because  we 
were  interested  in  ultimate  performance. 

In  all  three  versions,  the  networks  always  were  able  to  leant  to  label  the  training  set  perfectly;  that  is, 
there  were  no  errors  in  the  classification  of  either  sound  type  (the  syllable,  the  vowel,  or  the  consonant). 
When  presented  with  the  data  from  the  naive  set,  the  network  trained  to  label  whole  syllables  made  an 
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average  of  16%  errors.  The  networks  trained  to  label  vowels  and  consonants  made  an  average  of  1.5% 
and  7.9%  errors,  respectively.  The  results  for  the  vowels  and  consonants  were  about  the  same  whether 
two  or  three  hidden  units  were  used. 

We  found  that  the  performance  could  be  improved  by  introducing  certain  kinds  of  noise  into  the 
samples.  Simply  adding  random  noise  to  the  inputs  degraded  performance.  However,  if  the  samples 
were  distorted  by  adding  noise  proportional  to  each  value,  performance  was  significantly  improved. 
This  random  distortion  was  accomplished  by  replacing  each  input  value  x  by  (x  +xr ),  where  r  is  ran¬ 
domly  chosen  from  the  range  -0.5  to  0.5.  When  tested  without  noise  the  training  sets  still  learned  per¬ 
fectly.  On  naive  data,  the  performance  for  syllables,  vowels,  and  consonants  was  10%,  0.3%  and  5.0% 
errors,  respectively.  This  means  that  a  recognizer  based  on  vowels  and  consonants  would  have  an  accu¬ 
racy  of  about  95%. 

There  are  several  reasons  why  training  in  noise  should  increase  the  model’s  ability  to  generalize. 
First,  the  noise  effectively  expands  the  data  set;  each  syllable  is  represented  by  a  larger  number  of 
exemplars.  Second,  and  probably  more  important,  the  introduction  of  noise  helps  to  blur  stimulus 
ideosyncracies  that  might  be  learned  in  place  of  the  phonetically  valid  features.  This  results  in  greater 
error  during  the  teaching  phase,  but  better  generalization. 

Now  let  us  turn  to  the  hidden  units.  They  restructure  the  input  patterns  in  such  a  way  as  to  provide 
input  for  the  final  (output)  layer.  In  the  process  of  carrying  out  this  mapping,  they  encode  the  input 
patterns  as  feature  types.  One  can  ask  what  sorts  of  features  become  represented  in  the  hidden  units  as 
a  result  of  the  teaching  phase.  These  internal  representations  may  provide  a  clue  as  to  how  the 
phonetic  categorization  is  accomplished. 

In  order  to  visualize  the  relationship  between  hidden  unit  activity  and  input  sound  type  we  used  a 
technique  that  displays  the  average  activity  of  each  hidden  unit  at  a  different  spatial  position  for  each 
sound  type.  Examples  of  the  hidden  unit  activity  patterns  obtained  in  this  way  are  shown  in  Figure  3. 
Every  hidden  unit  has  become  absolutely  correlated  with  a  subset  of  sound  types.  Hidden  units  have 
outputs  of  1  for  some  sound  types  in  this  set  and  0  for  others.  In  addition  some  hidden  units  produce  a 
wide  range  of  output  values  for  tokens  not  in  the  absolute  correlation  set.  The  correlation  subsets  can 
be  vowel-like  or  consonant-like,  in  that  a  unit  is  completely  on  or  completely  off  for  some  consonant  or 
vowel.  In  the  example  illustrated  for  the  nine  label  case,  for  example,  two  of  the  hidden  units  are 
vowel-like  and  two  consonant-like.  Not  infrequently  a  unit  cleanly  represents  a  single  vowel  or  con¬ 
sonant  in  its  on  activity.  It  is  interesting  that  each  time  the  learning  procedure  is  rerun,  using  different 
random  initial  weights,  a  different  pattern  of  hidden  unit  correlations  is  observed.  However,  while 
several  unit  patterns  occur  often,  some  never  appear  at  all.  For  example,  no  hidden  unit  has  ever  been 
found  that  that  represents  the  [u]  sound  alone  by  an  on  unit.  This  contrasts  with  [a]  and  [i]  which  can 
be  so  represented. 

Another  version  of  label  learning  was  carried  out  in  which  a  larger  number  of  phonetic  labels  was 
employed,  reflecting  a  finer-grain  phonetic  analysis.  Each  of  the  nine  syllables  was  divided  into  a  con¬ 
sonantal  portion  and  a  vocalic  portion.  The  consonantal  stimulus  corresponded  to  the  First  32  ms  of  the 
syllable  (starting  5  ms  before  release  of  closure)  and  the  vocalic  stimulus  corresponded  to  the  32  ms  of 
the  syllable  that  occur  150  ms  after  the  release  of  closure.  This  yielded  18  new  stimuli.  Each  of  the  18 
stimuli  types  was  given  its  own  digital  label,  with  labels  randomly  assigned  to  nine-bit  codes.  A  net¬ 
work  consisting  of  320  input  units,  6  hidden  units,  and  9  output  units  was  trained  on  1,000.000  learning 
cycles  of  these  1010  (505  x  2)  stimuli. 

The  correlations  between  hidden  unit  activity  and  sound  type  is  displayed  in  Figure  4.  There  are  6 
columns,  corresponding  to  the  6  hidden  units,  and  18  rows,  corresponding  to  the  18  stimulus  types. 
The  phonetic  segment  is  indicated  by  an  upper  case  letter,  and  its  context  by  a  lower  case  letter.  Thus, 
"Ba"  refers  to  tokens  of  a  voiced  bilabial  stop,  extracted  from  the  syllable  [ba] ;  whereas  "dl"  refers  to 
tokens  of  a  high  front  vowel,  extracted  from  syllable  [di]. 

Figure  4  allows  us  to  look  at  the  internal  representation  that  has  been  developed  in  order  to  encode 
the  1010  tokens  as  18  phonetic  types.  The  representation  is  interesting  in  several  respects.  First,  we 
see  that  one  hidden  unit  (Unit  4)  is  always  on  for  the  first  nine  types,  and  off  for  the  last  nine  types.  It 
thus  serves  as  a  Consonant/Vowel  detector.  Note  that  the  learning  task  has  not  explicitly  required  that 
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FIGURE  3.  Panels  A-E  display  hidden  unit  response  patterns  for  different  versions  of  the  back  propagation  phonetic  labeling  net¬ 
works.  Each  column  shows  the  behavior  of  a  single  hidden  unit  for  ail  nine  sounds.  The  activity  of  the  units  is  coded  in  the 
degree  of  darkening  of  the  rectangle  associated  with  each  sound.  A  completely  black  rectangle  indicates  a  unit  with  average  activity 
of  about  1.0  for  that  sound.  Likewise  a  white  rectangle  (not  delineated  against  the  background)  indicates  an  average  activity  near 
0.0.  The  shaded  rectangles  indicate  intermediate  average  activities.  Panel  A  is  from  a  four  hidden  unit  network  trained  wuh  nine 
labels  signifying  syllables.  Panels  B  and  D  are  from  networks  trained  with  three  vowel  labels,  while  C  and  E  were  trained  to 
recognize  three  consonants. 
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FIGURE  4.  A  graph  of  the  hidden  unit  activity  associated  with  each  of  the  18  speech  inputs.  Each  column  shows  the  behavior  of 
a  single  hidden  unit  for  all  18  speech  sounds.  Upper  case  letters  indicate  which  portion  of  a  CV  syllable  was  presented  (consonant 
or  vowel);  lower  case  letters  indicate  the  context 


this  distinction  be  drawn.  The  network  has  simply  been  asked  to  learn  18  phonetic  labels.  It  happens 
that  the  Consonant/Vowel  is  a  useful  dimension  along  which  to  classify  types,  and  this  dimension  is 
implicit  in  the  stimuli.  In  other  cases  we  see  that  it  is  not  as  easy  to  interpret  single  hidden  units  by 
themselves.  When  both  Unit  2  and  Unit  4  are  on,  a  velar  stop  (Ga,  Gi,  Gu)  is  signaled;  otherwise,  the 
vowel  [i]  is  indicated. 

One  very  striking  result  is  the  response  pattern  for  Unit  0.  This  unit  is  always  on  (and  only  oni  for 
the  alveolar  stops  (Da,  Di,  Du).  What  makes  this  so  surprising  is  that  the  alveolar  stops  exhibit  a  great 
deal  of  acoustic  variability  across  different  vowel  contexts.  The  task  simply  required  that  the  network 
Ieam  labels  for  the  three  different  alveolar  allophones;  it  was  not  required  to  group  these  together  in 
any  way  (indeed,  there  was  no  feedback  that  informed  the  network  that  any  relationship  existed 
between  these  three  types).  Nonetheless,  the  network  has  spontaneously  determined  a  representation  in 
which  these  allophones  are  grouped  together. 

The  weights  connecting  the  input  to  the  hidden  units  are  a  kind  of  filter  through  which  the  sound 
stimuli  pass  to  determine  hidden  unit  activity.  The  shape  of  this  filter  is  indicative  of  the  sound 
features  recognized  by  the  hidden  units.  Examining  these  weight  profiles  can  give  us  some  understand¬ 
ing  of  these  features.  Figure  5  shows  graphs  of  the  input  weights  for  hidden  units  with  outputs  at  1  for 
only  a  single  sound  or  a  pair  of  sounds.  In  the  column  on  the  left  the  weights  for  several  vowel- 


I 


LEARNING  THE  HIDDEN  STRUCTURE  OF  SPEECH 


FIGURE  5.  These  ire  graphs  of  the  weights  connecting  the  input  sound  array  to  the  hidden  units.  Each  graph  represents  a  single 
hidden  unit.  The  weights  are  plotted  at  positions  that  correspond  to  the  frequency  and  time  of  their  attached  input.  The  format  of 
time  and  frequency  in  these  graphs  is  the  same  as  in  Figure  2.  The  hidden  units  are  all  from  networks  trained  to  recognize  either 
vowels  or  consonants.  The  hidden  units  are  on  only  for  the  sounds  indicated  in  the  lower  right-hand  comer  of  each  graph.  The  left 
hand  column  has  complete  graphs.  The  other  two  columns  have  flood  graphs,  representing  only  that  portion  of  the  graph  above  a 
high  tide  of  0.7  the  whole  range  of  weight  values.  All  values  below  the  high  tide  value  are  set  to  the  high  tide  value.  The  center 
column  is  based  on  the  same  data  as  the  one  to  the  left  The  complete  graphs  for  the  right-hand  column  are  not  shown. 


recognizing  units  are  depicted.  The  patterns  are  very  complex  and  little  can  be  gleaned  from  them.  In 
the  next  column  a  more  interpretable  "flood"  plot  of  the  same  data  is  shown;  the  flood  plot  shows  only 
those  peaks  above  some  "high  tide"  level.  The  important  differences  between  the  weight  arrays  become 
apparent,  showing  some  of  the  basis  for  distinguishing  between  the  various  sounds.  The  last  column  on 
the  right  is  a  flood  plot  of  some  consonant-recognizing  units.  The  flood  plots  of  the  vowel-recognizing 
units  and  the  consonant-recognizing  units  reveal  only  part  of  the  story.  The  negative  peaks  are  also 
important  in  the  recognition  process.  And  the  importance  of  the  finer  scale  structure  of  the  weight 
matrix  is  not  yet  known. 

Results  from  these  studies  of  phonetic  label  learning  indicate  that  this  approach  has  considerable 
power  and  can  be  successful  even  given  a  highly  confusable  set  of  stimuli.  Furthermore,  the  back- 
propagation  technique  results  in  internal  representations  that  have  interesting  properties. 

One  important  difficulty  with  this  approach  is  the  origin  of  the  phonetic  labels.  The  direct  teaching 
technique  requires  that  for  each  speech  stimulus  the  correct  label  be  known.  It  would  seem  desirable 
not  to  have  to  make  this  assumption.  For  example,  from  the  viewpoint  of  child  language  acquisition  we 
are  put  in  a  bit  of  a  teleological  quandry  if  we  must  assume  that  children  know  the  labels  of  the  sounds 
they  are  learning,  before  they  learn  them.  This  consideration  led  us  to  investigate  identity  mapping  as  a 
way  to  leam  phonetic  features. 
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IDENTITY  MAPPING  OF  SPECTROGRAMS 

In  the  previous  study,  the  input  and  output  patterns  were  different;  the  input  was  a  known  speech 
stimulus  and  the  output  was  its  abstract  phonetic  label.  This  interpretation  of  input  and  output  are  nei¬ 
ther  available  nor  necessary  to  the  operation  of  the  learning  algorithm.  It  is  simply  learning  a  function 
that  relates  two  patterns.  One  can  apply  the  learning  algorithm  in  a  different  mode  in  which  the  input 
and  output  pattern  are  the  same.  This  mode,  which  we  call  identity  mapping  (it  is  also  known  as  auto- 
association  [Ackley,  Hinton,  &  Sejnowksi,  1985]),  does  not  require  an  external  teacher.  Identity  map¬ 
ping  a  large  pattern  via  a  layer  of  a  few  hidden  units  has  been  shown  to  yield  useful  internal  representa¬ 
tions  that  give  explicit  information  about  the  structure  of  the  input  patterns  (Cottrell.  Munro,  &  Zipser. 
in  press;  Zipser,  in  press).  We  have  used  identity  mapping  with  the  same  sounds  descnbed  above  to 
see  if  useful  internal  representations  can  be  learned  in  the  absence  of  a  priori  knowledge  about  the 
"meaning"  or  "names"  of  patterns.  We  find  that  the  hidden  units  in  idenuty  mapping  come  to  represent 
both  previously  identified  speech  features  and  new,  not  easily  described,  features. 

The  stimuli  for  this  experiment  were  identical  to  those  used  in  previously.  They  consisted  of  505 
tokens  of  the  nine  consonant-vowel  (CV)  syllables,  represented  by  normalized  and  squashed  power 
spectra.  The  network  had  320  input  units,  between  2  and  10  hidden  units,  and  320  output  units.  The 
training  phase  was  similar  to  the  labeling  studies,  except  that  the  target  output  pattern  was  always  ident¬ 
ical  to  the  input  pattern. 

In  Figure  6A  we  see  an  example  of  the  hidden  unit  activations  that  developed  after  about  150,000 
learning  cycles.  Unit  3  is  a  vowel  unit  since  it  is  strongly  on  for  all  [a]  and  off  for  the  other  vowels. 
Units  2  and  4  are  consonant-like  units  since  they  are  on  quite  strongly  for  two  consonants  and  off  for  j 
thud.  Unit  3  is  vowel-like  but  encodes  some  consonant  information  also.  Units  5  and  6  cannot  be 
characterized  in  terms  of  vowels  and  consonants;  they  represent  some  feature  that  is  not  easily 
descnbed. 

Different  hidden  unit  activation  patterns  are  obtained  on  each  independent  run  of  the  same  learning 
problem,  but  the  same  general  kinds  of  hidden  units  are  found.  Sometimes  hidden  units  represent  a  sin 
gle  vowel  or  consonant  More  often  they  represent  a  strongly  correlated  encoding  of  mixed  sound  types 
as  is  the  case  with  Unit  1.  Units  like  5  and  6,  which  recognize  some  enigmatic  feature  are  also  quite 
common.  In  general,  the  fewer  the  number  of  hidden  units,  the  more  strongly  correlated  with  sound 
types  they  become. 

While  the  identity  mapping  network  we  have  described  was  trained  on  speech  that  was  not  phonen 
cally  labeled,  the  speech  tokens  had  been  laboriously  presegmented  into  syllables  It  occurred  to  us  that 
the  identity  mapping  network  might  be  able  to  segment  continuous  speech.  The  reason  this  might  be 
possible  is  that  error  would  be  expected  to  be  at  a  minimum  when  the  sounds  used  for  training  were  in 
register  on  the  input  units.  These  error  minima  would  then  signal  the  boundaries  between  syllables. 

To  test  this  possibility  we  synthesized  a  pseudo-continuous  speech  by  stringing  together  examples  ot 
the  nine  sound  types  in  random  order  and  shifting  this  sequence  through  the  input  one  time-step  per 
cycle.  This  resulted  in  an  stimulus  that  had  a  complete  sound  token  correctly  in  register  with  the  inputs 
only  once  every  20  cycles.  On  all  the  rest  of  the  cycles  the  input  consisted  ot  part  of  the  end  of  one 
token  and  part  of  the  beginning  of  another.  Networks  that  had  been  fully  trained  to  identit\  map 
presegmented  sounds  were  used  for  this  study,  but  their  learning  mechanism  was  turned  off.  On  each 
shift  cycle  the  total  error  was  computed.  (This  error  is  just  the  sum-squared  difference  between  the 
input  and  output  patterns.)  The  results  are  shown  in  Figure  6B. 

The  error  signal  has  a  clear  periodic  component,  decreasing  to  an  identifiable  minimum  each  time  a 
single  token  is  in  register  with  the  input.  The  reason  for  this  is  that  when  a  CV  syllable  is  in  svnc.  the 
network  "recognizes"  one  of  the  input  patterns  it  has  been  trained  previously  on  This  results  in  low 
error.  On  the  next  testing  cycle,  the  shifted  input  pattern  sail  resembles  one  of  the  learned  patterns  v 
error  is  relatively  low,  but  as  the  shifting  stimulus  gets  increasingly  out  of  registration,  the  emu 
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FIGURE  6.  A.  Hidden  unit  response  patterns  from  a  net  wort  that  had  been  trained  to  identity  map  the  nine  syllables  listed  on  the 
left  side.  The  average  activity  values  are  encoded  in  the  same  way  as  in  Figure  3.  B.  Stnp-chart-like  plots  of  the  tout  error  from 
an  identity  mapping  network  as  sound  tokens  are  continuously  shifted  through  the  input  space  as  described  in  the  text.  The  net¬ 
work  is  the  same  one  whose  hidden  unit  activities  are  shown  in  A  above.  The  "<"  marks  indicate  times  when  syllables  are  in  regis¬ 
ter  in  the  input  space.  The  lower  panel  is  a  continuation  of  the  right  end  of  the  upper  one. 

increases  The  results  of  this  simplified  study  indicate  that  identity  mapping  networks  could  be  used  for 
segmenting  continuous  speech.  One  can  also  envisage  using  multiple  networks  to  process  speech  in 
parallel.  A  network  trained  on  identity  mapping  could  be  used  to  locate  syllable  boundaries;  an  error 
minimum  could  then  be  used  to  activate  analysis  by  a  second  network  that  had  been  trained  to  do 
phonetic  labeling. 

We  have  seen  that  identity  mapping  can  be  used  to  leam  salient  features  without  labeling  and  may 
also  be  useful  in  segmenting  speech.  But  we  are  limited  by  the  need  to  presegment  the  input  for  train¬ 
ing  purposes.  This  is  an  undesirable  limitation  because  it  requires  a  teaching  environment  that  may  be 
richer  than  that  available  to  the  human  learner.  In  the  next  section  we  try  to  remove  the  requirement 
for  presegmentation  of  the  sound  stimulus. 
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IDENTITY  MAPPING  OF  CONTINUOUS  SPECTROGRAMS 


The  goal  in  this  study  was  to  see  what  son  of  representation  might  result  if  a  network  were  trained 
on  continuous  speech.  As  in  the  previous  study,  the  task  is  to  identity  map  the  input.  In  this  case,  we 
drop  the  restriction  that  the  input  must  correspond  to  a  CV  syllable.  Instead,  the  speech  input  consisted 
of  a  corpus  of  15  minutes  of  running  speech.  A  text  was  created  that  contained  (a)  the  digits  from  0  to 
9,  (b)  the  500  most  frequent  words  from  the  Kucera-Francis  corpus  (Kucera  &  Francis,  1967),  (c)  a 
phonetically  balanced  word  list  of  100  items,  and  (d)  a  prose  passage.  The  corpus  was  read  in  a 
conversational  manner  by  a  male  speaker  at  a  moderate  rate  under  relatively  quiet  conditions,  filtered  at 
3.5  kHz,  and  digitized  at  a  10  kHz  sampling  rate.  The  speech  was  hamming- windowed  and  analyzed 
by  a  128-point  FFT  using  overlapping  windows  that  advanced  64  points  per  frame.  The  power  spectra 
were  reduced  to  32  frequency  bins,  normalized,  and  squashed  in  a  manner  similar  to  that  described 
above. 

The  network  was  made  up  of  three  layers;  the  input  layer  contained  640  units,  the  hidden  layer  had  8 
units,  and  the  output  layer  contained  640  units. 

Given  the  magnitude  of  the  corpus,  the  computational  requirements  of  learning  such  a  database  to  an 
acceptable  level  of  error  are  considerable.  Pilot  studies  ran  for  approximately  2  weeks  on  a  VAX 
1 1/750  digital  computer  (with  FPA).  For  this  reason  we  carried  out  further  studies  on  the  Cray  XMP-4 
computer  of  the  San  Diego  Super  Computer  Center. 

The  network  was  trained  on  the  corpus  for  1,000,000  learning  cycles.  We  experimented  with  two 
modes  of  presentation.  In  one  mode,  the  speech  was  passed  through  the  input  layer  of  the  network  in  a 
continuous  fashion;  after  each  learning  cycle  the  input  layer  of  the  network  was  advanced  by  one  time 
interval  so  that  there  was  considerable  overlap  from  one  cycle  to  the  next.  In  the  second  mode,  a  sec¬ 
tion  of  the  corpus  was  selected  at  random  for  identity  mapping;  eventually  all  possible  (overlapping) 
portions  of  the  utterance  were  seen  by  the  network.  In  pilot  work,  we  found  no  differences  between  the 
two  modes;  the  random  mode  is  the  one  we  adopted  for  our  studies. 

At  the  completion  of  the  learning  phase,  the  network  had  been  trained  on  an  extensive  body  of 
speech.  The  speech  corpus  contains  approximately  140,000  different  input  patterns  (each  pattern  con¬ 
sisting  of  640  numbers).  Our  hope  was  that  this  sample  was  both  representative  of  the  variety  of 
speech  patterns  for  the  speaker,  while  at  the  same  time  containing  enough  regularity  that  the  network 
would  be  able  to  successfully  encode  the  patterns.  One  graphic  view  of  the  representation  that  is  built 
up  is  shown  in  Figure  7.  At  the  bottom  of  the  figure  we  see  a  spectrogram  of  a  section  of  the  training 
I  corpus.  Shown  above  it  are  eight  lines  that  graph  the  activations  of  the  hidden  units  when  the  speech 

shown  at  the  bottom  is  passed  through  the  network.  These  plots  can  be  thought  of  as  a  kind  of  feature 
representation  of  the  speech.  It  is  clear  that  features  have  steady-states  that  last  for  roughly  syllable¬ 
sized  periods  of  time.  It  has  proved  difficult,  however,  to  give  an  interpretation  of  the  content  of  these 
features.  An  important  question  is  how  much  information  has  been  preserved  by  the  encoding  contained 
in  the  eight  hidden  units.  One  can  test  this  by  seeing  whether  it  is  possible  to  teach  a  network  the 
I  phonetic  labels  for  speech  sounds  when  we  use  the  hidden  unit  representation  of  the  identity  mapped 

speech  rather  than  the  speech  itself.  This  involves  two  steps.  First,  we  take  the  nine  CV  syllables  used 
before,  pass  them  through  the  network  previously  trained  on  the  Cray  (using  the  15-minute  speech 
corpus),  and  then  save  the  hidden  unit  activations  that  result.  In  the  second  step,  we  use  the  hidden  unit 
activations  to  tram  a  second  network  to  label  the  activation  patterns  as  [ba],  [bi],  [bu],  etc.  This  step  is 
analogous  to  the  previous  labeling  studies,  with  the  important  difference  that  the  input  now  is  not 
I  speech  but  the  representation  of  speech  derived  from  the  Cray-trained  network.  The  labeling  network 

’  had  560  input  units  (to  accommodate  70  time  slices,  each  time  slice  lasting  6.4  ms  and  being 

represented  by  eight  hidden  unit  values),  8  hidden  units,  and  9  output  units.  The  nine  output  units  were 
used  to  encode  the  nine  different  syllable  types. 

After  approximately  100,000  learning  cycles  the  hidden  unit  activity  of  the  labeling  network  had  a 
reasonably  distinct  pattern  that  distinguished  the  nine  different  syllables.  A  more  rigorous  test  is  to  see 
|  how  many  categorization  errors  are  made  by  this  network.  There  is  an  overall  error  rate  of  13.5%  (false 
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FIGURE  7.  Hidden  unit  activity  from  a  network  that  has  been  trained  on  a  large,  unsegmented  speech  corpus.  The  output  activity 
of  the  eight  hidden  units  is  plotted  on  strip  chart-like  graphs  as  a  sample  of  the  speech  corpus  is  shifted  through  the  input  space  of 
the  network.  The  bottom  segment  of  the  figure  shows  an  FFT  of  the  sound  to  which  the  hidden  units  are  responding.  The  vertical 
timing  lines  are  600  ms  apart.  The  numbers  1-8  on  the  left  side  indicate  the  individual  hidden  units. 


reject  -  10.4%,  false  accept  -  3.1%),  but  most  of  it  is  due  to  a  very  high  error  rale  in  labeFng  [ga].  If 
this  syllable  is  removed,  the  overall  error  rate  drops  to  5.8%. 

This  result  is  quite  encouraging.  It  indicates  that  a  degree  of  dimension  reduction  has  been  achieved 
by  the  corpus  trained  network  using  only  eight  hidden  units,  without  an  enormous  loss  of  information. 
The  encoded  representation  is  rich  enough  that  the  identity  of  the  original  speech  can  still  be  extracted. 
Furthermore,  we  also  see  that  we  were  able  to  create  a  feature  representation  without  segmenting  the 
data  or  knowing  its  phonetic  identity.  Indeed,  features  obtained  in  this  way  may  eventually  be  useful 
for  the  task  of  segmentation. 

One  assumption  we  made  in  these  studies  was  that  the  power  spectra  of  the  speech  provides  a  good 
initial  representation  of  the  speech.  This  is  not  an  unreasonable  assumption,  and  there  are  in  fact  many 
additional  assumptions  one  might  have  made  (such  as  the  use  of  critical  bands,  pre-emphasis  of  higher 
frequencies,  etc.).  Nonetheless,  an  important  goal  of  this  investigation  has  been  to  see  how  much  of  the 
structure  of  speech  could  be  discovered  with  minimal  a  priori  assumptions  about  what  were  meaningful 
transformations  or  representations  of  the  data. 

In  this  spirit,  we  wondered  what  might  happen  if  we  abandoned  the  use  of  the  FFT  to  train  the  net¬ 
work.  Suppose  we  simply  presented  the  network  with  the  unanalyzed  digitalized  waveforms?  This  is 
what  we  did  in  the  final  study. 


IDENTITY  MAPPING  OF  CONTINUOUS  RAW  SPEECH 

In  this  study  we  attempted  to  see  whether  a  useful  input  representation  could  be  built  up  from  con¬ 
tinuous  speech  using  the  digital  waveform  itself. 

The  first  speech  corpus  we  used  consisted  of  the  simple  sentence,  This  is  the  voice  of  the  neural  net¬ 
work  (with  a  duration  of  approximately  4  seconds).  The  speech  was  kept  as  a  series  of  16-bit  samples. 
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each  value  representing  an  A/D  converter  voltage,  with  samples  occurring  at  100  microsecond  intervals. 
The  network  was  made  of  50  input  units,  20  hidden  units,  and  50  output  units.  As  in  the  previous 
experiment,  random  sections  of  the  corpus  were  selected  and  presented  as  input  and  target  for  identity 
mapping.  Since  each  such  section  contained  50  samples,  the  network’s  input  window  covered  5  ms  of 
speech. 

The  use  of  pulse  code  modulated  (PCM)  speech  in  identity  mapping  makes  it  very  easy  to  test  the 
network’s  performance  simply  by  collecting  the  output  and  converting  it  back  to  analog  form.  This 
requires  that  the  output  layer  contain  linear  units;  so,  while  the  hidden  layer  remained  nonlinear,  the 
output  layer  was  linear. 

After  one  million  learning  cycles,  we  froje  the  weight  values  and  fed  the  whole  training  data  set 
through  the  network  as  a  sequential  stream  of  nonoverlapping  50  sample  inputs.  The  output  was  con¬ 
verted  to  analog  form  and  played  over  speakers.  The  result  was  high-quality  speech;  spectrograms  of 
both  the  input  and  output  are  shown  in  Figure  8. 

We  were  interested  in  seeing  how  well  the  network  weights  would  generalize  to  novel  speech.  To 
test  this,  we  retrained  the  network  using  4  minutes  of  the  full  speech  data  base  that  was  used  in  the 
Cray  training  study  (but  in  PCM  form,  rather  than  as  power  spectra).  We  reasoned  that  this  larger  and 
more  varied  training  set  would  be  needed  in  order  to  learn  features  that  would  have  general  applicabil¬ 
ity.  Learning  proceeded  for  one  million  cycles,  using  the  same  presentation  method  as  with  the  simple 
sentence.  It  is  worth  noting  that  because  this  data  set  contained  2.5  million  different  50-sample  patterns 
less  than  half  the  data  was  seen,  and  any  pattern  that  was  presented  was  typically  seen  only  once. 

The  resultant  network  was  then  used  as  a  filter  for  the  original  neural  network  sentence.  A  spectro¬ 
gram  of  the  output  is  shown  in  Figure  9;  it  is  somewhat  degraded  compared  with  the  filter  that  is 
trained  on  the  sentence  itself,  but  is  still  quite  understandable. 

One  natural  question  to  ask  is  what  kind  of  encoding  the  hidden  units  have  discovered.  In  Figure  10 
we  see  spectrograms  of  the  outputs  of  the  individual  units  in  response  to  the  neural  network  sentence. 
It  is  clear  that  the  responses  do  not  resemble  single  sines  or  cosines,  showing  that  the  units  have  not 
learned  a  Fourier  decomposition.  One  thing  that  is  striking  is  the  extent  to  which  the  hidden  units’ 
spectral  responses  are  similar.  If  one  compares  the  spectrogram  of  the  input  sentence  itself  (Figure  8) 
with  those  of  the  hidden  units,  one  sees  the  way  in  which  this  is  so.  Most  of  the  hidden  unit  response 
frequencies  tend  to  center  around  the  regions  of  the  spectrum  that  are  relevant  for  speech  (this  is  not  the 
case  for  all  units;  there  are  some  that  are  distincdy  different).  The  units  have  thus  concentrated  mainly 
on  those  areas  of  the  spectrum  that  are  relevant  for  encoding  the  speech  data.  The  results  obtained  here 
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FIGURE  S.  Spectrograms  of  both  the  input  and  output  a  network  trained  to  identity  map  PCM  speech. 
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FIGURE  9.  A  spectrogram  of  the  output  of  tbe  network  of  Figure  8  given  u  input  the  "neural  network"  sentence  that  it  had  never 


with  sound  are  analogous  to  those  found  for  visual  images  by  Cottrell,  Munro,  and  Zipser  (in  press). 
Since  these  authors  were  able  to  demonstrate  considerable  bandwidth  compression  using  the  hidden  unit 
representations,  we  would  expect  that  bandwidth  compression  has  also  occurred  for  sound. 

The  time  scale  of  the  hidden  unit  features  is  quite  short  (the  features  encode  events  on  the  order  of  5 
ms).  We  are  primarily  interested  in  features,  but  also  in  representations  that  encode  larger  events. 
Therefore,  we  constructed  a  second  level  network  that  took  as  its  input  the  hidden  unit  representation 
derived  from  the  first  network  and  identity  mapped  it  to  an  output  layer.  This  second  network  had  400 
input  units,  10  hidden  units,  and  400  output  units.  It  was  able  to  look  at  representations  corresponding 
to  100  ms  (i.e„  it  saw  20  groups  of  20  hidden  unit  inputs  at  a  time,  and  each  group  of  20  hidden  units 
came  from  5  ms  of  speech  in  the  fust  network). 

The  first  network  in  this  two-level  system  was  the  one  trained  on  the  extended  speech  corpus,  and 
had  its  weights  fixed.  It  was  given  a  repeated  sequence  of  nonsense  syllables,  [ba]  [ba]  [ba]  [ba]  (in 
PCM  form),  as  a  continuous  stream  of  nonoverlapping  5  ms  inputs.  Every  5  ms  the  20  hidden  unit 
activations  from  this  fust  network  were  passed  to  the  input  layer  of  the  second  network.  After  400 
such  units  had  been  collected  by  the  second  network,  it  did  one  cycle  of  learning  in  an  identity  map¬ 
ping  mode.  On  each  succeeding  learning  cycle  the  input  pattern  in  the  second  network  was  shifted  left 
by  20  units,  and  a  new  20  units  were  received  from  the  fust  network.  This  sequence  of  events  contin¬ 
ued  until  approximately  800,000  learning  cycles  had  occurred  in  the  second  network  (remember  that  the 
fust  network  had  already  been  trained  and  was  not  subject  to  learning;  it  was  simply  acting  as  a  pre¬ 
processor  that  formatted  the  speech  in  a  manner  analogous  to  the  FFT  used  previously). 

After  the  second  network  was  trained  on  the  hidden  unit  representations,  we  froze  the  weights  in  the 
second  network.  We  then  ran  the  input  through  both  networks  in  a  continuous  stream.  As  we  did  this, 
we  examined  the  pattern  of  10  hidden  unit  activations  in  the  second  network.  These  hidden  unit  pat¬ 
terns  indicate  the  syllable  onsets.  Our  goal  now  is  to  explore  the  usefulness  of  this  higher  level  of 
representation  for  recognizing  larger  sets  of  speech  units,  although  at  this  point  we  remain  noncommital 
about  what  those  speech  units  will  be. 


CONCLUSION 

The  series  of  experiments  reported  here  are  clearly  preliminary  in  nature.  There  are  a  large  number 
of  questions  that  are  raised  by  this  work,  and  it  is  easy  to  think  of  many  alternative  ways  of  posing  the 
problems  we  have  presented  the  networks.  Nonetheless,  we  find  this  approach  to  discovering  the  hidden 
structure  in  speech  exciting  for  a  number  of  reasons. 

Power  The  domain  of  speech  processing  is  an  extremely  difficult  one.  There  are  a  targe  number  of 
problems  of  both  practical  and  theoretical  nature  which  remain  unsolved.  We  believe  that  the  experi¬ 
ments  described  here  demonstrate  that  the  PDP  framework  and  the  back-propagation  method  for  learn¬ 
ing  are  extremely  powerful. 
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FIGURE  10  Spectrograms  of  the  oulpuis  of  the  10  (of  the  20)  individual  hidden  units,  from  the  network  of  Figure  9,  in  response 
to  the  "neural  network’  sentence. 


In  the  Tint  labeling  study  we  saw  that  it  was  possible  to  build  a  system  that  could  be  taught  to 
correctly  categorize  a  number  of  highly  confusable  phonetic  segments;  and  that  having  learned  this,  the 
network  was  able  to  generalize  the  categorization  to  novel  data.  We  are  optimistic  that  the 
performance — which  was  good — can  be  improved  with  refinements  in  the  technique.  We  are  particu¬ 
larly  impressed  with  the  fact  that  an  encoding  was  found  in  which  one  hidden  unit  became  active  when¬ 
ever  an  alveolar  stop  was  presented,  regardless  of  vocalic  context,  and  of  another  which  did  the  same 
for  velar  stops.  It  is  well  known  that  both  of  these  consonants  exhibit  a  great  deal  of  contextual  varia¬ 
bility,  and  the  spontaneous  discovery  of  an  invariant  feature  is  surprising  and  gratifying. 


Representations.  An  important  goal  in  this  work  was  to  study  the  representations  that  result  from 
applying  the  back-propagation  learning  algorithm  to  speech.  We  feel  this  is  an  important  area,  which 
we  have  just  begun  to  study.  In  some  cases  the  representations  that  are  discovered  are  intuitively  sensi¬ 
ble  and  easy  to  interpret  In  the  labeling  studies  the  Consonant/Vowel  distinction  was  encoded  by  a 
single  unit.  In  other  cases  we  saw  that  the  representation  itself  assumed  a  distributed  form,  with  groups 
of  hidden  units  participating  in  (for  example)  the  encoding  of  place  of  articulation.  It  is  interesting  that 
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the  specific  representations  may  vary  when  learning  experiments  are  replicated.  This  suggests  that  mul¬ 
tiple  networks  learning  the  same  data  it.  ,y  provide  a  richer  representation  than  any  single  network. 
Finally,  we  saw  in  the  PCM  identity  mapping  studies  that  the  algorithm  finds  solutions  that  are  highly 
expedient  The  spectral  decomposition  that  was  carried  out  was  clearly  tuned  for  speech;  the  majority 
of  units  had  responses  that  focused  on  several  regions  of  the  spectrum,  and  these  were  precisely  those 
regions  that  are  highly  relevant  for  speech. 

Innate  vs.  learned  representations.  We  do  not  believe  that  the  work  described  here  necessarily 
makes  strong  claims  that  the  perceptual  representations  by  humans  are  learned.  On  the  other  hand,  we 
believe  that  the  work  does  argue  against  making  strong  claims  that  such  representations  must  be  innate. 
The  tendency,  in  linguistics  perhaps  more  than  psychology,  has  been  to  assume  that  much  of  the 
representation  apparatus  used  in  processing  language  must  be  learned.  In  large  part,  that  is  because  it 
has  seemed  to  many  people  that  the  representations  are  complex  and  often  arbitrary  and  that  the  input 
data  available  to  the  language  learner  for  those  representations  is  impoverished. 

We  feel  that  this  study  encourages  the  belief  that  more  information  about  the  structure  of  speech  is 
extractable  from  the  input  than  has  been  supposed.  The  back -propagation  method  of  learning  may  not 
in  fact  be  what  is  used  by  humans.  Still,  it  at  least  demonstrates  that  one  relatively  simple  algorithm 
does  exist  that  is  capable  of  uncovering  a  great  deal  of  structure  in  a  small  sample  of  speech.  It  is  our 
hope,  based  on  these  preliminary  studies,  that  it  will  be  possible  to  construct  a  hierarchy  of  learning 
networks  that  will  spontaneously  learn  to  recognize  speech  using  only  extensive  examples  of  input 
speech,  loosely  synchronized  with  transcribed  text  We  further  hope  that  this  task  can  be  accomplished 
in  such  a  way  as  to  shed  light  on  the  actual  mechanisms  used  by  the  brain. 
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